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Abstract 

Satellite image classification is a challenging problem that lies at the 
crossroads of remote sensing, computer vision, and machine learning. Due 
to the high variability inherent in satellite data, most of the current object 
classification approaches are not suitable for handling satellite datasets. The 
progress of satellite image analytics has also been inhibited by the lack of 
a single labeled high-resolution dataset with multiple class labels. The con¬ 
tributions of this paper are twofold - (1) first, we present two new satellite 
datasets called SAT-4 and SAT-6, and (2) then, we propose a classification 
framework that extracts features from an input image, normalizes them and 
feeds the normalized feature vectors to a Deep Belief Network for classi¬ 
fication. On the SAT-4 dataset, our best network produces a classification 
accuracy of 97.95% and outperforms three state-of-the-art object recognition 
algorithms, namely - Deep Belief Networks, Convolutional Neural Networks 
and Stacked Denoising Autoencoders by ~11%. On SAT-6, it produces a 
classification accuracy of 93.9% and outperforms the other algorithms by 
~15%. Comparative studies with a Random Forest classifier show the ad¬ 
vantage of an unsupervised learning approach over traditional supervised 
learning techniques. A statistical analysis based on Distribution Separability 
Criterion and Intrinsic Dimensionality Estimation substantiates the effective¬ 
ness of our approach in learning better representations for satellite imagery. 
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1 Introduction 


Deep Learning has gained popularity over the last decade due to its ability to learn 
data representations in an unsupervised manner and generalize to unseen data sam¬ 
ples using hierarchical representations. The most recent and best-known Deep 
learning model is the Deep Belief Network osi. Over the last decade, numerous 
breakthroughs have been made in the field of Deep Learning; a notable one being 
|[22l, where a locally connected sparse autoencoder was used to detect objects in 
the ImageNet dataset ifTTl producing state-of-the-art results. In llTTl . Deep Belief 
Networks have been used for modeling acoustic signals and have been shown to 
outperform traditional approaches using Gaussian Mixture Models for Automatic 
Speech Recognition (ASR). They have also been found useful in hybrid learning 
models for noisy handwritten digit classification |21. Another closely related ap¬ 
proach, which has gained much traction over the last decade, is the Convolutional 
Neural Network E^ . This has been shown to outperform Deep Belief Network in 
classical object recognition tasks like MNIST |[39l . and CIFAR Il20ll . 

A related and equally hard problem is Satellit^ image classification. It in¬ 
volves terabytes of data and significant variations due to conditions in data acqui¬ 
sition, pre-processing and filtering. Traditional supervised learning methods like 
Random Forests @ do not generalize well for such a large-scale learning problem. 
A novel classification algorithm for detecting roads in Aerial imagery using Deep 
Neural Networks was proposed in ll26l . The problem of detecting various land 
cover classes in general is a difficult problem considering the significantly higher 
intra-class variability in land cover types such as trees, grasslands, barren lands, 
water bodies, etc. as compared to that of roads. Also, in |[26l . the authors used 
a window of size 64 x 64 to derive contextual information. For our general clas¬ 
sification problem, a 64x64 window is too big a context covering a total area of 
64m X 64m. A tree canopy, or a grassy patch can typically be much smaller than this 
area and hence we are constrained to use a contextual window having a maximum 
dimension of 28m x 28m. 

Traditional supervised learning approaches require carefully selected hand¬ 
crafted features and substantial amounts of labeled data. On the other hand, purely 
unsupervised approaches are not able to learn the higher order dependencies in¬ 
herent in the land cover classification problem. So, we propose a combination 
of handcrafted features that were first used in |[T4]| and an unsupervised learning 
framework using Deep Belief Network ifTSl that can learn data representations 
from large amounts of unlabeled data. 

There has been limited research in the field of satellite image classification due 
to a dearth of labeled satellite image datasets. The most well known labeled satel¬ 
lite dataset is the NLCD 2006 Il38l . which covers the entire globe and provide a 
spatial resolution of 30m. However, at this resolution, it becomes extremely dif- 

*Note that we use the terms satellite and airborne interchangeably in this paper because the ex¬ 
tracted features and learning algorithms are generic enough to handle both satellite and airborne 
datasets. 
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ficult to distinguish between various landcover types. A high-resolution dataset 
acquired at a spatial resolution of 1.2m was used in ll2^ . However, the total area 
covered by the datasets namely URBANl and URBAN2 was ~600 square kilo¬ 
meters, which included both training and testing datasets. The labeling was also 
available only for roads. Satellite/airborne image classification at a spatial resolu¬ 
tion of 1-m was addressed in lUJ. However, they performed tree-cover delineation 
by training a binary classifier based on Feedforward Backpropagation Neural Net¬ 
works. 

The main contributions of our work are twofold - (1) We first present two 
labeled datasets of airborne images - SAT-4 and SAT-6 covering a total area of 
~800 square kilometers, which can be used to further the research and investigate 
the use of various learning models for airborne image classification. Both SAT-4 
and SAT-6 are sampled from a much larger dataset iHOl . which covers the whole of 
continental United States and can be used to create labeled landcover maps, which 
can then be used for various applications such as measuring ground carbon content 
or estimating total area of rooftops for solar power generation. 

(2) Next, we present a framework for the classification of satellite/airborne 
imagery that a) extracts features from the image, b) normalizes the features, and 
c) feeds the normalized feature vectors to a Deep Belief Network for classifica¬ 
tion. On the SAT-4 dataset, our framework outperforms three state-of-the-art ob¬ 
ject recognition algorithms - Deep Belief Networks, Convolutional Neural Net¬ 
works and Stacked Denoising Autoencoders by ~11% and produces an accuracy 
of 97.95%. On SAT-6, it produces an accuracy of 93.9% and outperforms the other 
algorithms by ~15%. We also present a statistical analysis based on Distribution 
Separability Criterion and Intrinsic Dimensionality Estimation to justify the ef¬ 
fectiveness of our feature extraction approach to obtain better representations for 
satellite data. 

2 DatasefE] 

Images were extracted from the National Agriculture Imagery Program (NAIP 
|[40l ) dataset. The NAIP dataset consists of a total of 330,000 scenes spanning 
the whole of the Continental United States (CONUS). We used the uncompressed 
digital Ortho quarter quad tiles (DOQQs) which are GeoTIFF images and the area 
corresponds to the United States Geological Survey (USGS) topographic quad¬ 
rangles. The average image tiles are ~6000 pixels in width and ~7000 pixels 
in height, measuring around 200 megabytes each. The entire NAIP dataset for 
CONUS is ~65 terabytes. The imagery is acquired at a ground sample distance 
(GSD) of 1 meter. The horizontal accuracy lies within 6 meters of ground con¬ 
trol points identifiable from the acquired imagery BTI . The images consist of 4 
bands - red, green, blue and Near Infrared (NIR). In order to maintain the high 
variance inherent in the entire NAIP dataset, we sample image patches from a 

^The SAT-4 and SAT-6 datasets are available at the web link 1421 
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Figure 1: Sample images from the SAT-6 dataset 


multitude of scenes (a total of 1500 image tiles) covering different landscapes like 
rural areas, urban areas, densely forested, mountainous terrain, small to large water 
bodies, agricultural areas, etc. covering the whole state of California. An image 
labeling tool developed as part of this study was used to manually label uniform 
image patches belonging to a particular landcover class. Once labeled, 28x28 non¬ 
overlapping sliding window blocks were extracted from the uniform image patch 
and saved to the dataset with the corresponding label. We chose 28 x 28 as the win¬ 
dow size to maintain a significantly bigger context as pointed by Ehll . and at the 
same time not to make it as big as to drop the relative statistical properties of the 
target class conditional distributions within the contextual window. Care was taken 
to avoid interclass overlaps within a selected and labeled image patch. Sample 
images from the dataset are shown in Figure [T] 

2.1 SAT-4 

SAT-4 consists of a total of 500,000 image patches covering four broad land cover 
classes. These include - barren land, trees, grassland and a class that consists of 
all land cover classes other than the above three. 400,000 patches (comprising of 
four-fifths of the total dataset) were chosen for training and the remaining 100,000 
(one-fifths) were chosen as the testing dataset. We ensured that the training and 
test datasets belong to disjoint set of image tiles. Each image patch is size normal¬ 
ized to 28x28 pixels. Once generated, both the training and testing datasets were 
randomized using a pseudo-random number generator. 
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2.2 SAT-6 


SAT-6 consists of a total of 405,000 image patches each of size 28x28 and cover¬ 
ing 6 landcover classes - barren land, trees, grassland, roads, buildings and water 
bodies. 324,000 images (comprising of four-fifths of the total dataset) were chosen 
as the training dataset and 81,000 (one fifths) were chosen as the testing dataset. 
Similar to SAT-4, the training and test sets were selected from disjoint NAIP tiles. 
Once generated, the images in the dataset were randomized in the same way as that 
for SAT-4. The specifications for the various landcover classes of SAT-4 and SAT-6 
were adopted from those used in the National Land Cover Data (NLCD) algorithm 

m- 

3 Investigation of various 
Deep Learning Models 

3.1 Deep Belief Network 

Deep Belief Network (DBN) consists of multiple layers of stochastic, latent vari¬ 
ables trained using an unsupervised learning algorithm followed by a supervised 
learning phase using feedforward backpropagation Neural Networks. In the un¬ 
supervised pre-training stage, each layer is trained using a Restricted Boltzmann 
Machine (RBM). Unsupervised pre-training is an important step in solving a clas¬ 
sification problem with terabytes of data and high variability. A DBN is a graphical 
model ||T9ll where neurons of the hidden layer are conditionally independent of one 
another for a particular configuration of the visible layer and vice versa. A DBN 
can be trained layer-wise by iteratively maximizing the conditional probability of 
the input vectors or visible vectors given the hidden vectors and a particular set of 
layer weights. As shown in ca, this layer-wise training can help in improving the 
variational lower bound on the probability of the input training data, which in turn 
leads to an improvement of the overall generative model. 

We first provide a formal introduction to the Restricted Boltzmann Machine. 
The RBM can be denoted by the energy function: 



( 1 ) 


where, the RBM consists of a matrix of layer weights W = (wij) between the 
hidden units hj and the visible units Vi. The Oj and bj are the bias weights for the 
visible units and the hidden units respectively. The RBM takes the structure of a 
bipartite graph and hence it only has inter-layer connections between the hidden 
or visible layer neurons but no intra-layer connections within the hidden or visible 
layers. So, the activations of the visible unit neurons are mutually independent for 
a given set of hidden unit activations and vice versa Q. Hence, by setting either h 
or V constant, we can compute the conditional distribution of the other as follows: 
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P{hj = 1|?;) = a{bj + '^WijVi) 


( 2 ) 


i=l 


n 


P{vi = l\h) = a{ai + '^Wijhj) 
i=i 

where, a denotes the log sigmoid function: 


(3) 



(4) 


The training algorithm maximizes the expected log probability assigned to the 
training dataset V. So if the training dataset V consists of the visible vectors v, 
then the objective function is as follows: 



(5) 


A RBM is trained using a Contrastive Divergence algorithm Q. Once trained, 
the DBN can be used to initialize the weights of the Neural Network for the super¬ 
vised learning phase |J3|. 

Next, we investigate the classification accuracy of various architectures of DBN 
on both SAT-4 and SAT-6 datasets. 

3.1.1 DBN Results on SAT-4 & SAT-6 

To investigate the performance of the DBN, we experiment with both big and deep 
neural architectures. This is done by varying the number of neurons per layer as 
well as the total number of layers in the network. Our objective is to investigate 
whether the more complex features learned in the deeper layers of the DBN are able 
to provide the network with the discriminative power required to handle higher- 
order texture features typical of satellite imagery data. The results from the DBN 
for various network architectures for SAT-4 and SAT-6 are enumerated in Table [U 
Each network was trained for a maximum of 500 epochs and the network state with 
the lowest validation error was used for testing. Regularization is done using L 2 
norm-regularization. It can be seen from the table that for both SAT-4 and SAT-6, 
the classifier accuracy initially improves and then falls as more neurons or layers 
are added to the network. 

3.2 Convolutional Neural Network 

Convolutional Neural Network (CNN) first introduced in ifT^ is a hierarchical 
model inspired by the human visual cortical system ifT^ . It was significantly im¬ 
proved and applied to document recognition in ll^ . A committee of 35 convolu¬ 
tional neural nets with elastic distortions and width normalization f9| has produced 
state-of-the-art results on the MNIST handwritten digits dataset. CNN consists of 
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Network Arch. 

Classifier 

Classifier 

Neurons/layer 

Accuracy 

Accuracy 

[Layers] 

SAT-4 (%) 

SAT-6 (%) 

100 [2] 

79.74 

68.51 

100 [3] 

81.78 

76.47 

100 [4] 

79.802 

74.44 

100 [5] 

62.776 

63.14 

500 [2] 

68.916 

60.35 

500 [3] 

71.674 

61.12 

500 [4] 

65.002 

57.31 

500 [5] 

64.174 

55.78 


Table 1: Classification Accuracy of DBN with various architectures on SAT-4 and 
SAT-6 


a hierarchical representation using convolutional layers and fully connected layers, 
with non-linear transformations and feature pooling. 

They also include local or global pooling layers. Pooling can be implemented 
in the form of subsampling, averaging, max-pooling or stochastic pooling. Each 
of these pooling architectures has its own advantages and limitations and numer¬ 
ous studies are in place that investigate the effect of different pooling functions on 
representation power of the model (||31]|,||30l). A very important feature of Con¬ 
volutional Neural Network is weight sharing in the convolutional layers, so that 
the same filter bank is applied to all pixels in a particular layer; thereby generating 
sparse networks that can generalize well to unseen data samples while maintaining 
the representational power inherent in deep hierarchical architectures. 

We investigate the use of different CNN architectures for SAT-4 and SAT-6 as 
detailed below. 

3.2.1 CNN Results on SAT-4 & SAT-6 

For CNN, we vary the number of feature maps in each layer as well as the total 
number of convolutional and subsampling layers. The results from various network 
configurations with increasing number of maps and layers is enumerated in Table 
1^ For the experiments, we used both 3x3 and 5x5 kernels for the convolutional 
layers and 3x3 averaging and max-pooling kernels for the sub-sampling layers. We 
also use overlapping pooling windows with a stride size of 2 pixels. The last sub¬ 
sampling layer is connected to a fully-connected layer with 64 neurons. The output 
of the fully-connected layer is fed into a 4-way softmax function that generates a 
probability distribution over the 4 class labels of SAT-4 and a 6-way softmax for 
the 6 class labels of SAT-6. In Table the “Ac-Bs(n)” notation denotes that the 
network has a convolutional layer with A feature maps followed by a sub-sampling 
layer with a kernel of size BxB. ‘n’ denotes the type of pooling function in the 
sub-sampling layer, ‘a’ denotes average pooling while ‘m’ denotes max-pooling. 
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From the table, it can be seen that the smallest networks consistently produce the 
best results. Also, both for SAT-4 and SAT-6, using networks with convolution 
kernels of size 3x3 leads to a significant drop in classifier accuracy. The biggesf 
nefworks wifh 50 maps per layer also exhibif significanl drop in classifier accuracy. 


Network Architecture 
(Convolution kernel size) 

Accuracy 

SAT-4 

(%) 

Accuracy 

SAT-6 

(%) 

6c-3s(a)-12c-3s(ni) (5x5) 

86.827 

79.063 

18c-3s(a)-36c-3s(m) (5x5) 

82.325 

78.704 

6c-3s(a)-12c-3s(m)-12c 

-3s(m)(5x5) 

81.907 

76.963 

50c-3s(a)-50c-3s(m)-50c 

-3s(m)(5x5) 

73.85 

75.689 

6c-3s(a)-12c-3s(m) (3x3) 

73.811 

54.385 

6c-3s(m)-12c-3s(m) (5x5) 

85.612 

77.636 


Table 2: Classificafion Accuracy of CNN wifh various archifecfures on SAT-4 


3.3 Stacked Denoising Autoencoder 

A Stacked Denoising Autoencoder (SDAE) llTTl consists of a combination of multi¬ 
ple sparse autoencoders, which can be trained in a greedy-layerwise fashion similar 
to that of Restricted Boltzmann Machines in a DBN. Each autoencoder is associ¬ 
ated with a set of weights and biases. In the SDAE, each layer can be trained 
independent of the other layers. Once trained, the parameters of an autoencoder 
are frozen in place. The training algorithm is comprised of two phases - a for¬ 
ward pass phase and a backward pass phase. The forward pass, also called as the 
encoding phase encodes raw image pixels into an increasingly higher-order repre¬ 
sentation. The backward pass simply performs the reverse operation by decoding 
these higher-order features into simpler representations. The encoding step is given 
as: 


a« = (6) 

And the decoding step is as follows: 

ain+i) ^ ( 8 ) 

^{n+l+l) _ y^{n-l,2)^{n+l) ^{n-l,2) 

The hidden unit activations of the neurons in the deepest layer are used for 
classification after a supervised fine-tuning using backpropagation. 











Supervised Fine-tuning 



|_ Feature J 
Vector 


Figure 2: Schematic of the DeepSat classification framework 


3.3.1 SDAE Results on SAT-4 & SAT-6 

Different network configurations were chosen for the SDAE in a manner similar 
to that described above for DBN and CNN. The results are enumerated in Table 
Similar to DBN, each network is trained for a maximum of 500 epochs and 
the lowest test error is considered for evaluation. As highlighted in the Table, 
networks with 5 layers and 100 neurons in each layer produce the best results on 
both SAT-4 and SAT-6. It can be seen from the table that on both datasets, the 
classifier accuracy initially improves and fhen drops wifh increasing number of 
neurons and layers, similar fo fhaf of DBN. Also, fhe biggesf nefworks wifh 500 
and 2352 neurons in each layer exhibif a significanf drop in classifier accuracy. 


Network Arch. 

Classifier 

Classifier 

Neurons/layer 

Accuracy 

Accuracy 

[Layers] 

SAT-4 (%) 

SAT-6 (%) 

100 [1] 

75.88 

74.89 

100 [2] 

76.854 

76.12 

100 [3] 

77.804 

76.45 

100 [4] 

78.674 

76.52 

100 [5] 

79.978 

78.43 

100 [6] 

75.766 

76.72 

500 [3] 

63.832 

54.37 

2352 [2] 

51.766 

37.121 


Table 3: Classification Accuracy of SDAE wifh various archifecfures on SAT-4 and 
SAT-6 


9 


























4 DeepSat - A Detailed 
Architectural Overview 

Figurej^schematically describes our proposed classification framework. Instead of 
the traditional DBN model described in Section [3!T| which takes as input the multi¬ 
channel image pixels reshaped as a linear vector, our classification framework first 
extracts features from the image which in turn are fed as input to the DBN after 
normalizing the feature vectors. 


4.1 Feature Extraction 


The feature extraction phase computes 150 features from the input imagery. The 
key features that we use for classification are mean, standard deviation, variance, 
2nd moment, direct cosine transforms, correlation, co-variance, autocorrelation, 
energy, entropy, homogeneity, contrast, maximum probability and sum of variance 
of the hue, saturation, intensity, and NIR channels as well as those of the color 
co-occurrence matrices. These features were shown to be useful descriptors for 
classification of satellite imagery in previous studies (HU, 133, HOl i. Since two 
of the classes in SAT-4 and SAT-6 are trees and grasslands, we incorporate features 
that are useful determinants for segregation of vegetated areas from non-vegetated 
ones. The red band already provides a useful feature for discrimination of vege¬ 
tated and non-vegetated areas based on chlorophyll reflectance, however, we also 
use derived features (vegetation indices derived from spectral band combinations) 
that are more representative of vegetation greenness - this includes the Enhanced 
Vegetation Index (EVI ITTll f. Normalized Difference Vegetation Index (NDVI 12^ . 
1351) and Atmospherically Resistant Vegetation Index (ARVI IT^ l. 

These indices are expressed as follows: 


EVI = Gx 


NIR - Red 


( 10 ) 


NIR + Cred X — cuue X Blue -|- L 

Here, the coefficients G, Cred, cuue and L are chosen to be 2.5, 6, 7.5 and 1 
following those adopted in the MODIS EVI algorithm BTI . 


NDVI = 


NIR - Red 
NIR -|- Red 


( 11 ) 


ARVI 


NIR — (2 X Red — Blue) 
NIR + {2 X Red + Blue) 


( 12 ) 


The performance of our learner depends to a large extent on the selected fea¬ 
tures. Some features contribute more than others towards optimal classification. 
The 150 features extracted are narrowed down to 22 using a feature-ranking algo¬ 
rithm based on Distribution Separability Criterion Q. Details of the feature rank¬ 
ing method along with the ranking for all the 22 features used in our framework is 
listed in Section [6.1.11 
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4.2 Data Normalization 


The feature vectors extracted from the training and test datasets are separately nor¬ 
malized to lie in the range [0,1]. This is done using the following equation: 


F, 


F-Frr 


normalized — 


F'm.ax Frr 


(13) 


where, Fmin and Fmax are computed for a particular feature type over all im¬ 
ages in the dataset. 


4.3 Classification 


The set of normalized feature descriptors extracted from the input image is fed into 
the DBN, which is then trained using Contrastive divergence in the same way as 


explained in Section 3.1 Once trained the DBN is used to initialize the weights of 
a feedforward backpropagation neural network. 

The neural network gives an estimate of the posterior probabilities of the class 
labels, given the input vectors, which is the feature vector in our case. As illustrated 
in [4J, the outputs of a neural network which is obtained by optimizing the sum- 
squared error-gradient function approximates the average of the class conditional 
distributions of the target variables 


yk{x) = {tk\x) = J tkp{tk\x)dtk (14) 

Here, tk are the set of target values that represent the class membership of the 
input vector Xk- For a binary classification problem, in order to map the outputs 
of the neural network to the posterior probabilities of the labeling, we use a single 
output y and a target coding that sets f"' = 1 if x” is from class Ci and f” = 0 if 
x^ is from class 6 * 2 . The target distribution would then be given as 


p{tk\x) = 6 {t- l)P(Ci|x) -h 6 {t)P{C 2 \x) (15) 

Here, 6 denotes the Dirac delta function which has the properties (5(x) = 0 if 
X 7 ^ 0 and 


From and we get 


6 {x) dx = 1 


(16) 


y{x) = P{Ci\x) (17) 

So, the network output y{x) represents the posterior probability of the input 
vector X having the class membership Ci and the probability of the class member¬ 
ship C 2 is given by P{C 2 \x) = 1 — y{x). This argument can easily be extended to 
multiple class labels for a generalized multi-class classification problem. 
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The feature extraction phase proves to be a useful dimensionality reduction 
technique that helps improve the discriminative power of the DBN based classifier 
significantly. 


5 Results and Comparative Studies 

The feature vectors extracted from the dataset are fed into DBNs with different 
configurations. Since, the feature vectors create a low dimensional representation 
of the data, so, DeepSat converges to high accuracy even with a much smaller 
network with fewer layers and very few neurons per layer. This speeds up network 
training by several orders of magnitude. Various network architectures along with 
the classification accuracy for DeepSat on the SAT-4 and SAT-6 datasets are listed 
in Table For regularization, we again use L 2 norm-regularization. From the 
Table, it is evident that the best performing DeepSat network outperforms the best 
traditional Deep Learning approach (CNN) by ~ 11% on the SAT-4 dataset and by 
~15% on the SAT-6 dataset. 

We also compare DeepSat with a Random Forest classifier fo invesfigafe fhe ad- 
vanfages gained by unsupervised pre-fraining in DBN as opposed fo fhe fradifional 
supervised learning in Random Foresfs. On SAT-4, fhe Random foresf classifier 
produces an accuracy of 69% while on SAT-6, if produces an accuracy of 54%. 
The highesf accuracy was obfained for a fores! wifh 100 frees. Further increase in 
fhe number of frees did nof yield any significanl improvemenf in classifier accuracy. 
If can be easily seen fhaf fhe various Deep archifecfures produce beffer classifica- 
fion accuracy fhan fhe Random Foresf classifier which relies solely on supervised 
learning. 


Network Arch. 

Classifier 

Classifier 

Neurons/layer 

Accuracy 

Accuracy 

[Layers] 

SAT-4 (%) 

SAT-6 (%) 

10 [2] 

96.585 

91.91 

10 [3] 

96.8 

87.716 

20 [2] 

97.115 

86.21 

20 [3] 

95.473 

93.42 

50 [2] 

97.946 

93.916 

50 [3] 

97.654 

92.65 

100 [2] 

97.292 

89.08 

100 [3] 

95.609 

91.057 


Table 4: Classificalion Accuracy of DeepSaf wifh various nefwork archifecfures on 
SAT-4 and SAT-6 
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Normalized NIR values of the 4 classes Normalized Feature values of the 4 classes 

(a) Distribution of NIR on the SAT-4 classes (b) Distribution of a sample DeepSat feature 

(Autocorrelation of Hue Color co-occurance 
matrix) on the SAT-4 classes 

Figure 3: Distributions of the raw NIR values for traditional Deep Learning Algo¬ 
rithms and a sample DeepSat feature for various classes on SAT-4 {Best viewed in 
color) 

6 Why Traditional Deep Architectures are not enough for 
SAT-4 & SAT-6? 

While traditional Deep Learning approaches have produced state-of-the-art results 
for various pattern recognition problems like handwritten digit recognition |[^ . 
object recognition ll20l . face recognition 1331, etc., but satellite datasets have high 
intra and inter-class variability and the amount of labeled data is much smaller as 
compared to the total size of the dataset. Also, higher-order texture features are 
a very important discriminative parameter for various landcover classes. On the 
contrary, shape/edge based features which are predominantly learned by various 
Deep architectures are not very useful in learning data representations for satellite 
imagery. This explains the fact why traditional Deep architectures are not able to 
converge to the global optima even for reasonably large as well as Deep architec¬ 
tures. 

Also, spatially contextual information is another important parameter for mod¬ 
eling satellite imagery. In traditional Deep Learning approaches like DBN and 
SDAE, the relative spatial information of the pixels is lost. As a result the orderless 
pool of pixel values which acts as input to the Deep Networks lack sufficient dis¬ 
criminative power to be well-represented even by very big and/or deep networks. 
CNN however, involves feature-pooling from a local spatial neighborhood, which 
justifies ifs improved performance over fhe ofher fwo algorifhms on bofh SAT-4 
and SAT-6. Even fhough our approach exfracfs an orderless pool of fealure vec- 
fors, fhe spafial confexf is already well-represenfed in fhe individual fealure values 
Ihemselves. We subslanfiafe our argumenls aboul fhe effectiveness of our fealure 
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extraction approach from a statistical point of view as detailed in the analysis be¬ 
low. 




Dist. b/w 

Means 

Standard 

Deviations 


Raw Images 

0.1994 

0.1166 


DeepSat Features 

0.8454 

0.0435 


Raw Images 

0.3247 

0.1273 

00 

DeepSat Features 

0.9726 

0.0491 


Table 5: Distance between Means and Standard Deviations for raw image values 
and DeepSat feature vectors for SAT-4 and SAT-6 


6.1 A Statistical Perspective based on Distribution Separability Cri¬ 
terion 

Improving classification accuracy can be viewed as maximizing the separability be¬ 
tween the class-conditional distributions. Following the analysis presented in iQ, 
we can view the problem of maximizing distribution separability as maximizing 
the distance between distribution means and minimizing their standard deviations. 
Figure]^ shows the histograms that represent the class-conditional distributions of 
the NIR channel and a sample feature extracted in the DeepSat framework. As 
illustrated in Tablethe features extracted in DeepSat have a higher distance be¬ 
tween means and a lower standard deviation as compared to the original image 
distributions, thereby ensuring better class separability. 


6.1.1 Feature Ranking 


Following the analysis proposed in Section 6.1 above, we can derive a metric for 
the Distribution Separability Criterion as follows: 


II ^mean \ 


(18) 


where ||(5mean|| indicates the mean of distance between means and <5^ indicates 
the mean of standard deviations of the class conditional distributions. Maximizing 
Dg over the feature space, a feature ranking can be obtained. Table shows the 
ranking of the various features used in our framework along with the values of 
the corresponding distance between means ||5mean||. standard deviation 6 a and 
Distribution Separability Criterion Dg. 


6.1.2 Distribution Separability and Classifier Accuracy 

In order to analyze the improvements achieved in the learning framework due to 
the feature extraction step, we measured the Distribution Separability of the mean 
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Rank 

Feature 

11 ^mean \ \ 

Sa 

Ds 

1 

I CCM mean 

0.4031 

0.1371 

2.9403 

2 

H CCM sosvh 

0.2359 

0.0928 

2.5413 

3 

H CCM aufoc 

0.2334 

0.1090 

2.1417 

4 

S CCM mean 

0.0952 

0.0675 

1.4099 

5 

H CCM mean 

0.0629 

0.0560 

1.1237 

6 

SR 

0.0403 

0.0428 

0.9424 

7 

S CCM 

2nd momenf 

0.0260 

0.0312 

0.8354 

8 

I CCM 

2nd momenf 

0.0260 

0.0312 

0.8354 

9 

I 2nd momenf 

0.0260 

0.0312 

0.8345 

10 

I variance 

0.0260 

0.0312 

0.8345 

11 

NIR sfd 

0.0251 

0.0315 

0.7980 

12 

Isfd 

0.0251 

0.0314 

0.7968 

13 

H sfd 

0.0252 

0.0317 

0.7956 

14 

H mean 

0.0240 

0.0314 

0.7632 

15 

I mean 

0.0254 

0.0336 

0.7541 

16 

S mean 

0.0232 

0.0319 

0.7268 

17 

I CCM 
covariance 

0.0378 

0.0522 

0.7228 

18 

NIR mean 

0.0246 

0.0351 

0.6997 

19 

ARVI 

0.0229 

0.0345 

0.6622 

20 

NDVI 

0.0215 

0.0326 

0.6594 

21 

DCT 

0.0344 

0.0594 

0.5792 

22 

EVI 

0.0144 

0.0450 

0.3207 


Table 6: Ranking of features based on Distribution Separability Criterion for SAT-6 


activation of the neurons in each layer of the DBN and that of DeepSat. The re¬ 
sults are noted in Figure]^ It can be seen that the mean activation learned by each 
layer of DeepSat exhibit a significantly higher distribution separability (by several 
orders of magnitude) than the neurons of a DBN. This justifies fhe significanf im- 
provemenf in performance of DeepSaf (using fhe feafures) as compared fo fhe DBN 
based framework (using fhe raw pixel values as inpuf). Also, a comparison of Fig¬ 
ure 1^ wifh Table [T] and Table shows fhaf fhe disfribufion separabilifies using fhe 
various archifecfures of fhe DBN and DeepSaf are posifively correlafed fo fhe final 
classifier accuracy. This jusfifies fhe effectiveness of our disfribufion separabilify 
mefric Dg as a measure of fhe final classifier accuracy. 
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(a) Distribution Separability Criterion of (b) Distribution Separability Criterion of 
DBN DeepSat 

Figure 4: Distribution Separability Criterion of the neurons in the layers of a DBN 
and DeepSat with various architectures on SAT-6 

7 What is the difference between MNIST, CIFAR-10 and 
SAT-6 in terms of dimensionality? 

We argue that handwritten digit datasets like MNIST and object recognition datasets 
like CIFAR-10 lie on a much lower dimensional manifold than the airborne SAT- 
6 dataset. Hence, even if Deep Neural Networks can effectively classify the raw 
feature space of object recognition datasets but the dimensionality of the airborne 
image datasets is such that Deep Neural Networks cannot classify them. In or¬ 
der to estimate the dimensionality of the datasets, we use the concept of intrinsic 
dimension^^. 

7.1 Intrinsic Dimension Estimation using the DanCo algorithm 

To estimate the intrinsic dimension of a dataset, we use the DANCo algorithm 
im. It uses the complementary information provided by the normalized nearest 
neighbor distances and angles calculated on pairs of neighboring points. 

Taking 10 rounds of 1000 random samples and averaging, we obtain the in¬ 
trinsic dimension for the MNIST, CIFAR-10 and SAT-6 datasets and the Haralick 
features extracted from the SAT-6 dataset. The results are listed in Table |7j 

So, it can be seen that the intrinsic dimensionality of the SAT-6 dataset is or¬ 
ders of magnitude higher than that of MNIST. So, a deep neural network finds it 
difficult to classify the SAT-6 dataset because of its intrinsically high dimension¬ 
ality. However, as seen in the equation above, the features extracted from SAT-6 
have a much lower intrinsic dimensionality and lie on a much lower dimensional 
manifold than the raw vectors and hence can be classified even by networks with 
relatively smaller architectures. 
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Dataset 

Intrinsic Dimension 

MNIST 

16 

CIEAR-IO 

17 

SAT-6 

115 

Haralick Eeatures extracted from SAT-6 

4.2 


Table 7: Intrinsic Dimension estimation using DANCo on the MNIST, CIFAR-10, 
and SAT-6 datasets and the Haralick features extracted from the SAT-6 dataset. 


7.2 Visualizing Data in an n-dimensional space 

We can visualize the data as distributed in an n-dimensional unit hypersphere 
Volume of the sphere, 


^sphere — 


vra 


r(f+ 1) 


= 


vr 2 


r(f+ 1) 


(19) 


for n-dimensional Euclidean space and T is Euler’s gamma function. Now, the 
total volume of the n-dimensional space can be accounted by the volume of an n- 
dimensional hypercube of length 2 embedding the hypersphere, i.e. Volume of the 
n-cube, 

Vcube = i?" = 2” (20) 

So, the relative fraction of the data points which lie on the sphere as compared to 
the data points on the n-dimensional embedding space is given as 


^relative 


Tra 


^sphere _ 

Vcube ~ 2-r(§ + 1) 


( 21 ) 


Vrelative ^ 0 aS TT ^ OO (22) 

This means that as the dimensionality of sample data approaches oo, the spread or 
scatter of the data points approaches 0 with respect to the total search space. As 
a result, various classification and clustering algorithms lose their discriminative 
power in higher dimensional feature spaces. 


8 Related Work 

Present classification algorithms used for Moderate-resolution Imaging Spectro- 
radiometer (MODIS)(500-m) l!T2l or Eandsat(30-m) based land cover maps like 
NECD ||3^ produce accuracies of 75% and 78% resp. The relatively lower resolu¬ 
tion of the datasets makes it difficult to analyze the performance of these algorithms 
for 1-m imagery. A method based on object detection using Bayes framework and 
subsequent clustering of the objects using Eatent Dirichlet Allocation was pro¬ 
posed in lIMl . However, their approach detects object groups at a higher level of 
abstraction like parking lots. Detecting the objects like cars or trees in itself is 
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not addressed in their work. A deep convolutional hierarchical framework was 
proposed recently by IJH. However, they report results on the AVIRIS Indiana’s 
Indian Pines test site. The spatial resolution of the dataset is limited to 20m and it is 
difficult to evaluate the performance of their algorithm for object recognition tasks 
at a higher resolution. An evaluation of various feature learning strategies was done 
in ll^ . They evaluated both feature extraction techniques as well as classifiers like 
DBN and Random Fores! for various aerial dafasefs. However, since fhe fraining 
dafa was significanlly limifed, fhe DBN was nol able fo produce any improvemenfs 
over Random Fores! even when raw pixel values were fed info fhe classifier. In 
confrasl, our sfudy shows fhaf DBNs can be beffer classifiers when fhere is sig- 
nifican! amounf of fraining dafa fo initialize fhe neural nefwork af a global error 
basin. 


9 Conclusions and Future Directions 

Our semi-supervised learning framework produces an accuracy of 97.95% and 
93.9% on fhe SAT-4 and SAT-6 dafasefs and significanfly oufperforms fhe sfafe- 
of-fhe-art by ~11% and ~15% respectively. The Feafure exlracfion phase is in¬ 
spired by fhe remofe sensing liferafure and significanlly improves fhe discrimina- 
live power of fhe framework. For salellile dafasefs, wilh inherenlly high variabilify, 
Iradilional deep learning approaches are unable fo converge fo a global optima even 
wilh significanlly big and deep archileclures. A sfafislical analysis based on Dis- 
Iribufion Separability Crilerion juslifies fhe effecliveness of our feafure exlracfion 
approach. 

We plan fo invesfigafe fhe use of various pooling techniques like SPM |[2ll as 
well as cerlain sparse represenlalions like sparse coding ll24l and Hierarchical rep- 
resenlafions like Convolutional DBN ESI fo handle salellile dafasefs. We believe 
lhal SAT-4 and SAT-6 will enable researchers fo learn heller represenlalions for 
salellile dafasefs and creafe benchmarks for fhe classificalion of salellile imagery. 
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